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Abstract: Clustering mixed type data is one of the major research topics in the area of data mining. In 
this paper, a new algorithm for clustering mixed type data is proposed where the concept of distribution 
centroid is used to represent the prototype of categorical variables in a cluster which is then combined 
with the mean to represent the prototype of clusters with mixed type variables. In the method, data is 
observed from different views and the variables are grouped into different views. Those instances that 
can be viewed differently from different viewpoints can be defined as multiview data. During clustering 
process the differences among views are ignored in usual cases. Here, both views and variables weights 
are computed simultaneously. The view weight is used to determine the closeness or density of view and 
variable weight is used to identify the significance of each variable. With the intention of determining 
the cluster of objects both these weights are used in the distance function. In the proposed method, 
enhancement to the k-prototypes is done so that it automatically computes both view and variable 
weights. The proposed algorithm MK-Prototypes algorithm is compared with two other clustering 
algorithms. 
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I. Introduction 

Clustering is a fundamental technique of unsupervised learning in machine learning and statistics. It is 
generally used to find groups of similar items in a set of unlabeled data. The aim of clustering is to divide a set 
of data objects into clusters so that those data objects that belongs to the same cluster are more similar to each 
other than those in other clusters [1-4]. In real world, datasets usually contain both numeric and categorical 
variables [5,6]. However, most existing clustering algorithms assume all variables are either numeric or 
categorical , examples of which include the k-means [7], k-modes [8], fuzzy k-modes [9] algorithms. Here, the 
data is observed from multiple outlooks and in multiple types of dimensions. For example, in a student data set, 
variables can be divided into personal information view showing the information about the student's personal 
information, the academic view describing the student's academic performance and the extra-curricular view 
which gives the extra-curricular activities and achievements made by the student. 

Traditional methods take multiple views as a set of flat variables and do not take into account the 
differences among various views [10], [11], [12]. In the case of multiview clustering, it takes the information 
from multiple views and also considers the variations among different views which produces a more precise and 
efficient partitioning of data. 

In this paper, a new algorithm Multi -viewpoint K-prototypes (MK-Prototypes) for clustering mixed 
type data is proposed. It is an enhancement to the usual k-prototypes algorithm. In order to differentiate the 
effects of different views and different variables in clustering, the view weights and individual variables are 
applied to the distance function. Here while computing the view weights, the complete set of variables are 
considered and while calculating the weights of variables in a view, only a part of the data that includes the 
variables in the view is considered. Thus, the view weights show the significance of views in the complete data 
and the variables weights in a view shows the significance of variables in a view alone. 

II. Related Works 

Till date, there exist a number of algorithms and methods to directly deal with mixed type data. In [13], 
Cen Li and Gautam Biswas proposed an algorithm, Similarity-based agglomerative clustering(SBAC) that 
works well for data with mixed attributes. It adopts a similarity measure proposed by Goodall [14] for biological 
taxonomy. In this method, while computing the similarity, higher weight is assigned to infrequent attribute value 
matches. It does not make any suppositions on the underlying features of the attribute values. An agglomerative 
algorithm is used to generate a dendrogram and a simple distinctness heuristic is used to extract a partition of the 
data.Hsu and Chen proposed CAVE [15], a clustering algorithm based on the Variance and Entropy for 

| IJMER | ISSN: 2249-6645 | www.ijmer.com | Vol. 4 | Iss. 4 | Apr. 2014 | 55 | 



MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data 



clustering mixed data. It builds a distance hierarchy for every categorical attributes which needs domain 
expertise. Hsu et al.[16] proposed an extension to the self-organizing map to analyze mixed data where the 
distance hierarchy is automatically constructed by using the values of class attributes. 

In [17] Chatziz propsed KL-FCM-GM algorithm in which data derived from the clusters are in the 
Guassian form and is designed for the Guass-Multinomial distributed data. 

Huang presented a k-prototypes algorithm [18] where k-means is integrated with k-modes to partition 
mixed data. Bezdek et al. considered the fuzzy nature of the objects in his work the fuzzy k-prototypes[19] and 
Zheng et al. proposed [20] an evolutionary type k-prototypes algorithm by introducing an evolutionary 
algorithm framework. 

III. Proposed System 

The motivation for the proposed system is on one hand to provide a better representation for the 
categorical variable part in a mixed data since the numerical variables can be well represented using the mean 
concept itself. On the other hand it considers the importance of view and variables weights in the process of 
clustering. The concept of distribution centroid represents the cluster centroid for the categorical variable part. 
Huang's strategy of evaluation is used for the computation of both view weights and variable weights. 

A. The distribution centroid 

The idea of distribution centroid for a better representation of categorical variables is stimulated from 
fuzzy centroid proposed by Kim et al.[ 21]. It makes use of a fuzzy scenario to represent the cluster centers for 
the categorical variable part. 

For Dom(Vj)={{i^, vf, vf,... vf}, the distribution centroid of a cluster o, denoted as C 0 , is represented as follows 

Q> = { c ol' C o2' ■■■ < c oj i — c om\ (1) 

where 



c'oj = {{bl, W l j }\bf, W l j },..\bf, W t j \..\bj, W l j }} (2) 

In the above equation 

n 

= 2_ l Kx i} ) (3) 



vv O] 



where 



if r.. = h k 

K*y> Z \ (4) 

(o ifxy * bf 



Here, u i0 is assigned the value 1, if the data object X; belongs to cluster o and as 0, if the data object X; do not 
belong to cluster o 

From the above mentioned equations it is clear that the computation of distribution centroid considers the 
number of times each categorical value repeat in a cluster. Thus to denote the center of a cluster it takes into 
account the distribution features of categorical variables 

B. Weight calculation using Huang's approach 

Weight of a variable identifies the effect of that variable in clustering process. In 2005, Huang et al. 
proposed an approach to calculate the weight of variable [22]. According to their method, the weight is 
computed by minimizing the value of objective function. 

The standard for assigning weight of variable is to allocate a larger value to a variable that has a 
smaller sum of the within cluster distances (WCD), and vice versa. This principle is given by 

1 

Wj oc — (5) 
u i 

where Wj is the significance of the variable j, oc is the mathematical symbol denoting direct proportionality, and 
D j is the sum of the within cluster distances for this variable. 
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C. Multiview concept 
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FIGURE 1 : Multiview concept 

In 2013, Chen Et Al [23] proposed Tw-K-Means where the concept of multiview data was introduced. 
The above figure illustrates the multiview concept. During the process of clustering, the differences among 
different views are not considered. In the process of multiview clustering, in addition to variable weights, the 
variables are grouped according to their characteristic properties. Each group is termed as a view and a weight is 
assigned to each view. The view weight is assigned according to Huang's approach. 

D. The proposed algorithm 

The proposed algorithm, MK-prototypes put together the concepts in section 3.1, section 3.2, section 
3.3. The figure 2 describes the steps involved in the algorithm: 
Steps in the proposed algorithm: 

1 . Compute the distribution centroid to represent the categorical variable centroid 

2. Compute the mean for the numerical variables 

3. Integrate the distribution centroid and mean to represent the prototype for the mixed data 

4. Compute the view weights and variable weights. 

5. Measure the similarity between the data objects and the prototypes 

6. Assign the data object to that prototype to which the considered data object is the closest 

7. Repeat steps 1-6 until an effective clustering result is obtained. 

E. The optimization model 

The clustering process to partition the dataset X into k clusters that considers both view weights and 
variable weights is represented according to the framework of [23] as a minimization of the following objective 
function. 

k n Q 

P(U,Z,R,V) = YjYjYjY; u ^ v t r sd{x iiSl z 0iS ) (6) 
0=1 1=1 t=l seG t 

subject to Tio=i u io = 1» u n e {0,1}, 1 < t < n 

Q 

V v t = 1, 0 < v t < 1, 0 < rj < 1, 1 < t < Q, 

i=i 

where U is an n x k partition matrix whose elements u l 0 are binary where u l 0 = 1 indicates that object i is 
allocated to cluster o.Z = {Z 1 ,Z 2 , ■■■■Z k ] is a set of k vectors on behalf of the centers of the k clusters. V = 
{V 1 , V 2 , ... . Vq] are Q weights for Q views. R = {r 1; r 2 . . .r s } are s weights for s variables. s , z 0 s ) is a distance 
or dissimilarity measure on the s th variable between the i th object and the center of the o th cluster. 
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FIGURE 2. Flowchart for the proposed algorithm 

In order to minimize the equation, the problem is divided into four sub-problems: 

1 . Sub-problem 1 : Fix Z=Z A ,R=R A and V= V A and solve the reduced problem P(U,Z A ,R A , V A ). 

2. Sub-problem 2: Fix U=U A , R=R A and V=V A and solve the reduced problem P(U A ,Z,R A ,V A ). 

3. Sub-problem 3: Fix Z=Z A , U=U A and V=V A and solve the reduced problem P(U A ,Z A ,R,V A ). 

4. Sub-problem 4: Fix Z=Z A , R=R A and U=U A and solve the reduced problem P(U A ,Z A ,R A ,V). 



The sub-problem 1 is solved by: 
if 

where 1< e < k 



u, „ = 1 



in iil 

2v, r ,d( W ,,)<2>d( w ,,) 

S=l S=l 



(7) 
(8) 



u i 0 = 0 where e o 
The sub-problem 2 is solved for the numeric variable by 
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z o,s Y" 11 ^ J 

t->i=\ u i,o 

and for the categorical variables by z o s = c' is which is already defined. 

d(x iiS ,z 0iS ) = |x; s — z os | if the sth variable is a numeric variable . 

d{x ls ,z 0 s ) = <p\x is ,z 0i s) if the sth variable is a categorical variable . 

where <p{x ls ,z 0 s ) = Yi\=i S\x iiS ,bj c ) and S[x iiS ,bf) is 0 £/ x is fo ; fc and w£ ; - if x is = 

The solution to the sub-problem 3 is as followed: 

Let Z=Z A , U=U A and V=V A be fixed . Then the reduced problem P(U A ,Z A ,R,V A ) is minimized if 

r s = r (10) 



where 



ZheG t [ D S J 



k 



D s = 2_ l Z_ l U i,o W t d ( X i,s> z 'o,s) (11) 



0=1 i = l 

Sub-problem 4 is solved as follows 

w, = !— j (12) 

where 

ft n 

= U i. 0 r ' S d ( X i, S ' Z ' 0 ,s) (13) 

0=1 i = l seG t 

Having presented the detailed computations required for calculating the important variables, the proposed 
algorithm 

MK-Prototypes can be described as given below: 

1 . Choose the number of iterations, number of clusters k, value of |i and y, randomly choose k distinct data 
objects and convert them into initial prototypes and initialize the view weights and variable weights. 

2. Fix Z', R', V as Z f , 7' respectively and minimize the problem P(U, Z', R', V) to obtain U t+1 . 

3. Fix U', R', V as U t ,R t , V t respectively and minimize the problem P(U', Z, R', V) to obtain Z t+1 . 

4. Fix U', Z', V as U t ,Z t , V 1 respectively and minimize the problem P(U', Z', R, V) to obtain R t+1 . 

5. Fix U', Z', R', V as U l ,Z l R* respectively and minimize the problem P(U', Z', R', V) to obtain V t+1 . 

6. If there is no improvement in P or if the maximum iterations is reached, then stop. Else increment t by 1 , 
decrement number of iterations by 1 and go to Step 2. 

IV. Experiments on Performance Of Mk- Prototypes Algorithm 

In order to measure the performance level of the proposed algorithm, it is used to cluster a real-world dataset 
Heart (disease). The dataset is taken from UCI Machine Learning Repository. 

The proposed algorithm is compared with k-prototypes and SBAC algorithm. They are well known for 
clustering mixed type data. In this paper, the clustering accuracy is measured using one of the most commonly 
used criteria. The clustering accuracy r is given by 

r = (14) 
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where a; is the number of data objects that occur in both the ith cluster and its corresponding true class and n is 
the number of data objects in a data set. 

Higher the value of r , the higher the clustering accuracy . A perfect clustering gives a value of r=1.0. 

A. Dataset description 

The Heart disease data set is a mixed dataset. It contains 303 patient instances. The actual data set 
contains 76 variables out of which 14 are considered usually. In the proposed algorithm, in order to define three 
views 19 out of 76 variables are considered here. It consists of seven numeric variables and twelve categorical 
variables. 

These 19 variables can be naturally divided into 3 views. 

1. Personal data view: It includes those variables which describes a patient's personal data. 

2. Historical data view: It includes those variables which describes a patient's historical data like the habits. 

3. Test output view: It includes all those variables which describes the results of various tests conducted for 
the patient. 

Here, G 1 ,G 2 , G 3 represents the three views personal, historical, test output respectively. 

B. Results and analysis 

Below are the graphical representations of the clustering results. Fig 3 shows the variation in variable 
weights for varying p values and fixed y values. Fig 4 shows the variation in view weights for varying p values 
and fixed y values. 

From Table 1, it is observed that as |i increased, the variance of V decreased rapidly. This result can be 
explained from equation (10) as p. increases, V becomes flatter. The graphical representation of the Table 1 has 
been shown below. 

Table 1: Variable weights vs y value For fixed \i value 
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Table 2 shows that as y increased, the variance of view weights decreased rapidly. This result can be explained 
from equation (1 1) as y increases, W becomes flatter. The graphical representation has been shown below. 
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Fig 3: Variable weights vs y value for fixed fi value 
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Table 2: View weights vs n value for fixed y value 
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From above analysis, it can be summarized that the following method can be used to control two types of weight 
distributions in MK-Prototypes algorithm by setting different values of y and u,. 
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Figure 4: View weights vs fi value for fixed y value 

The experiments have been conducted for three different values of u. and y for varying values of y and u. 
respectively. 

1 . Large [i makes more variables contribute to the clustering while small |i makes only important variables 
contribute to the clustering. 

2. Large y makes more views contribute to the clustering while small y makes only important views 
contribute to the clustering. 

Table 3: Comparison of accuracy rates of dataset considering all views 



Algorithms 


Clustering accuracy % 


k-prototypes 
SBAC 

MK-Prototypes 


0.521 
0.747 
0.846 



From the above table, it is clear that the proposed algorithm has a better clustering accuracy than the 
existing k-prototypes and SBAC. 

V. Conclusion 

Mixed type data are encountered everywhere in the real world. In this paper, a new algorithm, 
Multiview point based clustering algorithm for mixed type data has been proposed. When compared with the 
existing algorithms the proposed algorithm has many significant contributions. The proposed algorithm 
encapsulates the characteristics of clusters with mixed type variables more efficiently since it includes the 
distribution information of both numeric and categorical variables. 

It also takes into account the importance of various variables and views during the process of clustering 
by using Huang's approach and a new dissimilarity measure. 
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It can compute weights for views and individual variables simultaneously in the clustering process. 
With the two types of weights, dense views and significant variables can be identified and effect of low-quality 
views and noise variables can be reduced. 

Because of these contributions the proposed algorithm obtains higher clustering accuracy, which has 
been validated by experimental results. 
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